Project 1
DESCRIPTION
Reduce the time a Mercedes-Benz spends on the test bench.
Problem Statement Scenario: Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.
To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.
You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.
Following actions should be performed:
If for any column(s), the variance is equal to zero, then you need to remove those variable(s). Check for null and unique values for test and train sets. Apply label encoder. Perform dimensionality reduction. Predict your test_df values using XGBoost.
import os as os
os.getcwd()
os.chdir('D:\Data Science\SL Downloads\dataset\Merc Benz')
os.getcwd()
import pandas as pd
rawdata=pd.read_csv('./train/train.csv')
rawdata.head(5)
# List of int, float,object columns
Types = rawdata.dtypes.reset_index()
Types.columns = ["Count", "Column Type"]
Types.groupby("Column Type").count()
There are 8 categoric columns and remaining are numeric
rawdata.shape
rawdata.describe()
rawdata.isnull().sum()
import seaborn as sns
sns.heatmap(rawdata.isnull())
Looks like no null values in any columns
#rawdata['X0'].unique()
for i in rawdata.drop(['ID','y'],axis=1):
print('Unique elements in.. '+ i +'-column')
print(rawdata[i].unique())
print('-----------------------------------------')
import matplotlib
import pandas_profiling as pp
pp.ProfileReport(rawdata)
Number of observations 4209
Number of variables 378
BOOL 368 CAT 8 NUM 2
Null values- None
Target Y- Distinct count-2545
X11 and few others can be neglected, because all rows have same values. Many columns show high correlation
rawdata.head()
## Annova check on Categorical columns
import statsmodels.api as sm
from statsmodels.formula.api import ols
col = ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']
for i in col:
model = ols('y ~ ' + i , data=rawdata).fit()
print('Column : {}, F-statistic : {}, p-value : {}'.format(i, model.fvalue, model.f_pvalue))
High F statistics scores columns can be kept. X4 has pvalue close to 0.05, so we fail to reject Null Hypotghesis that- This columns dont affect target.
Types = rawdata.dtypes.reset_index()
Types.columns = ["Count", "Column Type"]
Types.groupby("Column Type").count()
# Numeric columns
numeric=Types[Types["Column Type"]=='int64'].Count
numeric
rawdata[numeric]
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
ftr = rawdata[numeric].drop(['ID'],axis=1)
trgt= rawdata.y
fs= SelectKBest(f_regression, k="all")
fs.fit(ftr,trgt)
scores = list(fs.scores_)
pvalues = list(fs.pvalues_)
fcols= list(ftr.columns)
scores[0:5]
pvalues[0:5]
fcols[0:5]
# List of tuples with feature and their importance
table = [(col, score, round (pvalue,4)) for col, score, pvalue in zip(fcols, scores, pvalues)]
print(table[0:5])
## sorting with pvalues in desc order of pvalues
table= sorted(table, key = lambda x: x[2], reverse = True)
print(table[0:5])
#Lets put them in a dataframe
newdf= pd.DataFrame(table, columns = ['Colname', 'fscore', 'pvalue'])
newdf
# There are some NAN values too
# Our Null Hyp - These columns don't have any effect on target
# If pvalue < 0.05 we reject Null Hyp, otherwise fail to reject Null Hyp
# So for pvalues greater than 0.05 , they don't have any effect on target
# Hence, high p values can be rejected, meaning those columns can be neglected.
dropcols= newdf[newdf['pvalue'] > 0.05]
#dropcols=dropcols['Colname']
dropcols=dropcols.Colname.values
dropcols
# Lets check Nan values
newdf.isnull().sum()
# 12 columns have Null values
newdf[newdf['fscore'].isnull()]
rawdata['X11']
#rawdata['X11'].unique()
rawdata['X11'].sum()
## These columns all have only one value '0', also highlighted by Pandas profiling
## These columns can also be ignored.
#One hot encoding
import pandas as pd
rawdata=pd.get_dummies(rawdata)
rawdata
# Collecting X and Y
X = rawdata.drop(['ID','y'],axis=1).values
Y = rawdata['y'].values
X
Y
# Splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1, test_size=0.3)
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
# import the ML algorithm
from sklearn.linear_model import LinearRegression
# instantiate
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, Y_train)
linreg.coef_
linreg.intercept_
Y_test.shape
#Making predictions
# make predictions on the testing set
Y_pred = linreg.predict(X_test)
Y_pred.shape
# import libraries for metrics
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept : ', linreg.intercept_)
#print('beta coefficients : ', linreg.coef_)
print('Mean Abs Error MAE : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq Error MSE : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value : ', metrics.r2_score(Y_test, Y_pred))
from sklearn import neighbors
# Modelling
clf = neighbors.KNeighborsRegressor()
clf.fit(X_train, Y_train)
Y_pred=clf.predict(X_test)
Y_pred
Y_test
# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept : ', linreg.intercept_)
#print('beta coefficients : ', linreg.coef_)
print('Mean Abs Error MAE : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq Error MSE : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value : ', metrics.r2_score(Y_test, Y_pred))
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
clf = clf.fit(X_train, Y_train)
Y_pred=clf.predict(X_test)
Y_pred
Y_test
# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept : ', linreg.intercept_)
#print('beta coefficients : ', linreg.coef_)
print('Mean Abs Error MAE : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq Error MSE : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value : ', metrics.r2_score(Y_test, Y_pred))
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf = clf.fit(X_train, Y_train)
Y_pred=clf.predict(X_test)
Y_pred
Y_test
# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept : ', linreg.intercept_)
#print('beta coefficients : ', linreg.coef_)
print('Mean Abs Error MAE : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq Error MSE : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value : ', metrics.r2_score(Y_test, Y_pred))
With K-Folds cross-validator
Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
from sklearn.model_selection import KFold
KF= KFold(n_splits=10)
Parameters of RF regressor
clf = RandomForestRegressor(n_estimators=50,min_samples_split=0.1, max_depth=10, criterion='mse', max_features='sqrt')
clf = clf.fit(X_train, Y_train)
Evaluate a score by cross-validation
from sklearn.model_selection import cross_val_score
KFresult=cross_val_score(estimator=clf,X=X,y=Y,cv=KF, scoring='neg_mean_squared_error')
print( 'Mean Squared Error : ', KFresult.mean())
# from F regression on numeric cols
dropcols
# columns with only one value in them
dropcols2=newdf[newdf['fscore'].isnull()].Colname.values
# from Annova on Cat columns- X4 can be discarded
rawdata=pd.read_csv('./train/train.csv')
# dropping x4
rawdata = rawdata.drop(['X4'],axis=1)
#dropping dropcols columns
dropcols=pd.Series(list(dropcols))
rawdata=rawdata.drop(rawdata[dropcols],axis=1)
#dropping dropcols2 columns
dropcols=pd.Series(list(dropcols2))
rawdata=rawdata.drop(rawdata[dropcols2],axis=1)
#One hot encoding
import pandas as pd
rawdata=pd.get_dummies(rawdata)
rawdata
# Collecting X and Y
X = rawdata.drop(['ID','y'],axis=1).values
Y = rawdata['y'].values
Y
# Splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1, test_size=0.3)
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
## KKN for regression
from sklearn import neighbors
# Modelling
clf = neighbors.KNeighborsRegressor(n_neighbors=17, metric='hamming', weights= 'distance')
clf.fit(X_train, Y_train)
Y_pred=clf.predict(X_test)
Params for KNN
1.n_neighborsint, default=5. Number of neighbors to use by default for kneighbors queries.
2.weights{‘uniform’, ‘distance’} or callable, default=’uniform’ weight function used in prediction. Uniform weights are used by default.
3.algorithm {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’. Algorithm used to compute the nearest neighbors.
## KNN metrics
# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept : ', linreg.intercept_)
#print('beta coefficients : ', linreg.coef_)
print('Mean Abs Error MAE : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq Error MSE : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value : ', metrics.r2_score(Y_test, Y_pred))
With Kfold
from sklearn.model_selection import cross_val_score
KFresult=cross_val_score(estimator=clf,X=X,y=Y,cv=KF, scoring='neg_mean_squared_error')
print( 'Mean Squared Error : ', KFresult.mean())
## Lets try Decision Tree for regression
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
clf = clf.fit(X_train, Y_train)
Y_pred=clf.predict(X_test)
## Decision Tree metrics
# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept : ', linreg.intercept_)
#print('beta coefficients : ', linreg.coef_)
print('Mean Abs Error MAE : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq Error MSE : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value : ', metrics.r2_score(Y_test, Y_pred))
## Lets try Random Forest for regression
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf = clf.fit(X_train, Y_train)
Y_pred=clf.predict(X_test)
## Random Forest metrics
# import libraries for metrics and reporting
import numpy as np
from sklearn import metrics
# Model evaluation metrics for regression
#print('y-intercept : ', linreg.intercept_)
#print('beta coefficients : ', linreg.coef_)
print('Mean Abs Error MAE : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq Error MSE : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('MAPE : ', np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100)
print('MPE : ', np.mean((Y_test - Y_pred) / Y_test) * 100)
print('r2 value : ', metrics.r2_score(Y_test, Y_pred))
## tunning on Random Forest
from sklearn.model_selection import KFold
KF= KFold(n_splits=10)
clf = RandomForestRegressor(n_estimators=50,min_samples_split=0.1, max_depth=10, criterion='mse', max_features='sqrt')
clf = clf.fit(X_train, Y_train)
from sklearn.model_selection import cross_val_score
KFresult=cross_val_score(estimator=clf,X=X,y=Y,cv=KF, scoring='neg_mean_squared_error')
print( 'Mean Squared Error : ', KFresult.mean())
# Fit regression model
from sklearn.ensemble import GradientBoostingRegressor
params = {'n_estimators': 1500,
'max_depth': 4,
'min_samples_split': 2,
'learning_rate': 0.005,
'loss': 'ls'}
gbr = GradientBoostingRegressor(**params)
# Train GB regressor
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1, test_size=0.3)
gbr.fit(X_train, Y_train)
from sklearn.model_selection import cross_val_score
KFresult=cross_val_score(estimator=gbr,X=X,y=Y,cv=KF, scoring='neg_mean_squared_error')
print( 'Mean Squared Error : ', KFresult.mean())
# Tuning Gradient Boost
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
KF=KFold(n_splits=5, random_state=20)
scor={'r2':'r2', 'MSE':'neg_mean_squared_error'}
scores=cross_validate(estimator=gbr,X=X_train,y=Y_train,cv=KF,scoring=scor,return_train_score=True)
#scores.keys()
print('Train MSE')
print(scores['train_MSE'].mean())
print('Train R2')
print(scores['train_r2'].mean())
print('-------------vs---------------')
print('Test MSE')
print(scores['test_MSE'].mean())
print('Test R2')
print(scores['test_r2'].mean())
pip install xgboost
# train test split
X_train, X_test, Y_train, Y_test = train_test_split(X,
Y,
test_size=0.2,
random_state=123)
X_test
import xgboost as xgb
train = xgb.DMatrix(X_train,Y_train)
test = xgb.DMatrix(X_test, Y_test)
# parameters for tuning
params={'max_depth': 7,
'min_child_weight': 2,
'eta': 0.005,
'subsample': 0.8,
'colsample_bytree': 1,
'objective': 'reg:linear',
'eval_metric': 'mae'}
num_boost_round = 999
%%time
model = xgb.train(
params,
train,
num_boost_round=num_boost_round,
evals=[(test, "Test")],
early_stopping_rounds=10
)
# Predict
from sklearn import metrics
Y_pred = model.predict(train)
print("Training : metrics ...")
print('Mean Abs Error MAE : ', metrics.mean_absolute_error(Y_train, Y_pred))
print('Mean Sq Error MSE : ', metrics.mean_squared_error(Y_train, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_train, Y_pred)))
print('r2 value : ', metrics.r2_score(Y_train, Y_pred))
Y_pred = model.predict(test)
print('\n')
print("Testing : metrics ...")
print('Mean Abs Error MAE : ', metrics.mean_absolute_error(Y_test, Y_pred))
print('Mean Sq Error MSE : ', metrics.mean_squared_error(Y_test, Y_pred))
print('Root Mean Sq Error RMSE : ', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))
print('r2 value : ', metrics.r2_score(Y_test, Y_pred))
R2 of 0.64 & MSE of 58 on Training vs 0.61 R2 & 58 MSE on Testing looks okay. Increasing the max depth to a higher number (eg:20) gives better results on Testing set, but training score is poor. At max depth of 7, and other tuning parameters, this trade off gives a decent result.